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Abstract 

A main problem of "Follow the Perturbed Leader" strategies for online de- 
cision problems is that regret bounds are typically proven against oblivious 
adversary. In partial observation cases, it was not clear how to obtain perfor- 
mance guarantees against adaptive adversary, without worsening the bounds. 
We propose a conceptually simple argument to resolve this problem. Using 
this, a regret bound of 0{t$) for FPL in the adversarial multi-armed bandit 
problem is shown. This bound holds for the common FPL variant using only 
the observations from designated exploration rounds. Using all observations 
allows for the stronger bound of 0(y/t), matching the best bound known so 
far (and essentially the known lower bound) for adversarial bandits. Surpris- 
ingly, this variant does not even need explicit exploration, it is self-stabilizing. 
However the sampling probabilities have to be either externally provided or 
approximated to sufficient accuracy, using 0(t 2 logt) samples in each step. 

Keywords: expert advice, online algorithms, partial observations, adaptive adversary, ban- 
dit problems, FPL 

1 Introduction 

"Expert Advice" stands for an active research area which studies online algorithms. 
In each time step t = 1,2,3,... the master algorithm, henceforth called master for 
brevity, is required to commit to a decision, which results in some cost. The master 
has access to a class of experts, each of which suggests a decision at each time step. 
The goal is to design master algorithms such that the cumulative regret (which is 
just the cumulative excess cost) with respect to any expert is guaranteed to be small. 

*This work was supported by JSPS 21st century COE program C01. 
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Bounds on the regret are typically proven in the worst case, i.e. without any statistical 
assumption on the process assigning the experts' costs. In particular, this might be an 
adaptive adversary which aims at maximizing the master's regret and also knows the 
master's internal algorithm. This implies that (unless the decision space is continuous 
and the cost function is convex) the master must randomize in order to protect against 
this danger. 

In the recent past, a growing number of different but related online problems have 
been considered. Prediction of a binary sequence with expert advice has been popular 
since the work of Littlestone and Warmuth in the early 1990's. Freund and Schapire 
FS97! removed the structural assumption on the decision space and gave a very 



general algorithm called Hedge which in each time step randomly picks one expert 
and follows its recommendation. We will refer to this setup as the online decision 
problem. Auer et al. [ACBFS95, ACBFS03j considered the first partial observation 
case, namely the bandit setup, where in each time step the master algorithm only 
learns its own cost, i.e. the cost of the selected expert. All these and many other 
papers are based on weighted forecasting algorithms. 

A different approach, Follow the Perturbed Leader (FPL), was pioneered as early 
as 1957 by Hannan [Han57j and rediscovered recently by Kalai and Vempala [KV03J. 
Compared to weighted forecasters, FPL has two main advantages and one major 
drawback. First, it applies to the online decision problem and admits a much more 
elegant analysis for adaptive learning rate |HP05j . Even infinite expert classes do 



not cause much complication. (However, the leading constant of the regret bound is 
generically a factor of v2 worse than that for weighted forcasters.) Adaptive learning 
rate is necessary unless the total number of time steps to be played is known in 
advance. 

As a second advantage, FPL also admits efficient treatment of cases where the 
expert class is potentially huge but has a linear structure |MB04[ IAK04| . We will 
refer to such problems as geometric online optimization. An example is the online 
shortest path problem on a graph, where the set of admissible paths = experts is 
exponential in the number of vertices, but the cost of each path is just the sum of the 
costs of the vertices. 

FPL's main drawback is that its general analysis only applies against an oblivious 
adversary, that is an adversary that has to decide on all cost vectors before the game 
starts - as opposed to an adaptive one that before each time step t just needs to 
commit to the current cost vector. For the full information game, one can show 
that a regret bound against oblivious adversary implies the same bound against an 
adaptive one [HP05 . The intuition is that FPL's current decision at time t does not 
depend on its past decisions. Therefore, the adversary may well decide on the current 
cost vector before knowing FPL's previous decisions. This argument does not apply in 
partial observation cases, as there FPL's behavior does depend on its past decisions 
(because the observations do so). As a consequence, authors started to explicitly 
distinguish between oblivious and adaptive adversary, sometimes restricting to the 
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former, sometimes obtaining bounds of lower quality for the latter. E.g. McMahan 
and Blum |MB04j suggest a workaround, proving sublinear regret bounds against an 
adaptive bandit, however of worse order (tiy/logt instead of ta, for both, geometric 
online optimization and online decision problem). This is not satisfactory, since in 
case of the bandit online decision problem for a suitable weighted forecaster, even a 
0(y/t) bound against adaptive adversary is known ACBFS03J. 

In this work, we remove FPL's major drawback. We give a simple argument 
(Section |2J) which shows that also in case of partial observation, a bound for FPL 
against an oblivious adversary implies the same bound for adaptive adversary. This 
will allow in particular to prove a O '((triy/\ogn)%\ bound for the bandit online decision 
problem (Section EJ). This bound is shown for the common construction where only 
the observations of designated exploration rounds are used. As this master algorithm 
is label efficient, the bound is essentially sharp. In contrast, using all informations 
will enable us to prove a stronger 0(^tn logn) bound (Section EJ). This matches 
the best bound known so far for the adversarial bandit problem [ACBFS03 , which 
is sharp within ^/\ogn. The downside of this algorithm is that either the sampling 
probabilities have to be given by an oracle, or they have to be approximated with to 
sufficient accuracy, using 0(t 2 logt) samples. The case of an infinite expert class is 
briefly discussed in Section 03 



2 FPL: oblivious adaptive 

Assume that ci, C2, . . . G [0, l] n is a sequence of cost vectors. There are n > 1 experts. 
(We will give an example with infinitely many experts Section but for simplicity 
of presentation, we restrict our main exposition to finite expert classes). That is, c\ 
is expert i's cost at time t, and the costs are bounded (w.l.o.g. in [0, 1]). In the full 
observation game, at time t the master would know the past cumulative costs c <t = 
Cx : t_i = Xll=i c s (observe that we have introduced some notation here). However, our 
focus are partial observations where this is not the case. Hence, assume that there 
are estimates dt (to be specified later) for the cost vectors q. Then at time t, FPL(t) 
samples a perturbation vector q t G [0, oo) n the components of which are independently 
exponentially distributed, that is, ~P{q\ > x) — e~ x . Afterwards, the expert with the 
best (minimum) score c <t — ^ is selected, where r\ t > is the learning rate: 

FPL(t , c<t) — arg min < & <t ~ ~r \ where q\ ~ Exp independently. (1) 

l<i<n I J 

Denote the expert FPL chooses at time t by I t — FPL(t,c <t ). Then an adaptive 
adversary is a function A : [0, l] nx ' _1 x {1...77,}' -1 — > [0, l] n . (We assume A to 
be deterministic but remark that all our results and proofs hold for randomized A 
without major modification.) The complete game between FPL and A is specified by 
c t = A{c\c<i . . . Ct-i, hh ■ ■ ■ h-i) and I t = FPL(t, c <t ) for t — 1, 2, . . . The estimated 
cost vector q is revealed to FPL after time t and specified by a mechanism "outside" 
this game which is defined later (this is the exploration). 
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After the game has proceeded for a number of time steps T, we want to evalu- 
ate FPL's performance. Actually, the expected performance is the right quantity to 
address. If we are rather interested in high probability bounds on the actual per- 
formance, then they are easily obtained by observing that the difference of actual 
to expected performance is a martingale with bounded differences (all instantaneous 
costs c\ are in [0, 1]). Thus, high probability bounds follow by Azuma's inequality, as 
we will demonstrate in Proposition 01 

How can we compute FPL's expected costs Ec^jt = E^^cf'? The key obser- 
vation is that - on the cost vectors generated by FPL and A and with the given 
estimated costs c\ - FPL's expected costs at time t are the same as another algorithm 
FPL's expected costs. FPL is defined by 



where q* is a single fixed vector with independently exponentially distributed com- 
ponents. Since we have to be careful to take expectations w.r.t. the appropriate 
randomness, we explicitely refer to the randomness in the notation by writing e.g. 
Ecf^ = E ?t c^ . Then the following statement trivially holds, as q t and have the 
same distribution. 

Proposition 1 At each time t <T , we have E^cf^ = Eg.cjp*'. 

This means that in order to analyze FPL, we may now proceed by considering the 
expected costs of FPL instead. We can use the standard analysis based on the tools 
by Kalai and Vempala |K V03j . which requires that FPL is executed on a sequence of 
cost vectors that is fixed and not known in advance. Actually, in contrast to the full 
observation game analysis, the bandit analysis will never require the true cost vectors 
to be revealed, but rather the estimated cost vectors. For the cost vectors generated 
by A in response to FPL, the prerequisite for FPL is satisfied - just consider FPL as 
a virtual or hypothetic algorithm which is not actually executed. Therefore it does 
not make any decisions or cause any response from the adversary. Just for the sake 
of analysis we pretend that it runs and evaluate the expected cost it incurs, which is 
the same as FPL. 

Since our key argument and the way it is used in the analysis appears quite subtle 
at the first glance, we encourage the reader to thoroughly verify each of the subsequent 
formal steps. 



3 The standard strategy against adversarial ban- 
dits 




(2) 



The first algorithm we consider, bandit-FPL (bFPL), is specified in Figure and pro- 
ceeds as follows. At time t, it decides if to perform an exploration or an exploitation 
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Fort = 1,2, 3,... 
set c\ = for all i 

sample r t G {0, 1} independently s.t. P[r t — 1] = j t 
If r t = Then set l\ = FPL(t, c <t ) according to © 
If rt — 1 Then sample l\ from {1 . . . n} uniformly (If = 

u t ) 

play decision J 4 and observe cost c t * 
If rt = 1 Then set c t * = n ■ c t * /jt 

Figure 1: The algorithm bFPL. The exploration rate j t and the learning rate r] t (used 
by subroutine FPL) will be specified in Theorem |21 

step according to some exploration probability ^ t G (0, 1). This is realized by sampling 
r t G {0, 1} independently from all other randomness with P[r t = 1] = j t . In case of ex- 
ploration (r t = 1), the decision l\ is uniformly sampled from {1 . . . n}, independently 
from all other randomness. We denote this choice by u t . (For notational convenience, 
we will also refer to the irrelevant UtS in the exploitations steps later.) In case of 
exploitation (r t = 0), bFPL obtains its decision l\ by invoking FPL according to (0). 
After bFPL has played its decision, it observes its own costs c t l . Finally, only in case 
of exploration (r t = 1), the estimated cost vector is set to something different from 
0. This is the standard way of constructing an FPL variant against an adversarial 
bandit jMB041 IAK04| . We will discuss how to make use of all observations in the 
next section. Here is the formal specification of the algorithm again. 

ut if r t = 1 ti = f ^ if r t = 1 A i = I\ 

FPL(t,c <t ) otherwise, * \ o otherwise. 

Consequently, the estimated cost vector is chosen unbiasedly, i.e. E ruUt c l t = c\. This 
technique was introduced in ACBFS95J. 

1 2 x 2 

Theorem 2 Let j t = min {l, t~s (n^/log n) 3 } and r\ t = ^t~3 (n^/log n) 3 . Then, for 
any T > (nlogn) 2 , each expert i G {l...n}, and any adaptive assignment of the 
costs ci, C2, • • ., bFPL satisfies the regret bound 

2 

Ec^ L - c\, T < 4 [Tn v/log^J 5 . (3) 
(ForT < (nlogn) 2 , the regret is clearly at most (nlogn) 2 . ) 

Proof. All computations we use in the subsequent proof have been taken or adapted 
from other work. Our point is to bring them into the right order and to carefully 
check that in this context, against an adaptive adversary, all operations are legit- 
imate. In particular we have to take care that all expectations are w.r.t. the ap- 
propriate randomness. Again, we make this explicit in the notation and write e.g. 



I t 6 = bFPL(t,c <t ) = 
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Ecj? = E gtjri:tini:t cp L . Note that according to the definition of bFPL, Ec t bFPL in 
fact does not depend on q <t . During the proof, we will avoid the use of unspecified 
expectation (without subscripts). Let's introduce abbreviation h <t = (r <t ,u <t ,q<t) 
for the randomization history, i.e. the tuple containing all past random variables. 

Moreover, we will use conditional expectation. For instance, E g Jcf PL |/i <t ] denotes a 
random variable depending on the randomization history h <t , where for each possible 
history the expectation is taken w.r.t. q t . Since we admit adaptive assignments, we 
must be aware that they may depend on bFPL's past randomness. To make this 
explicit, we use the notation E[c£|/i<t] for the adversary's decisions and rewrite our 
bound to show (jSJ) as 

rp rp 

^E gt , rtiMt [ C r L \h<t] - E E NlM ^ ^Tnv/lol^) 1 . (4) 
t=i t=i 

In order to keep the presentation simple, we assume the adversary to be deterministic. 
Then for given randomization history, c\ is constant. The same proof (and hence the 
theorem) remains valid if we admit randomized adversaries. 

First note that E ?tirt)Ui [c bFPL \h <t ] < 'E qt [cJ PL \h <t \ + j t holds in each time step t by 

definition of bFPL and c 4 * < 1. Since j t < t~ a (rii/log n) 3 , we have 

T T 2 

E^ < J2 r H n V^n)^ < f^TnVlogn)'. (5) 

t=i t=i 

Therefore, (0J) follows from 

T T 

J2 E, t [cf L | M - E E & \ h <t]<l ( Tn V^) § • ( 6 ) 

t=i t=\ 

Consider this form of FPL (i.e. FPL executed in each time step) as a virtual algorithm: 
It does not run in that way on the inputs. Rather, for the sake of analysis, we 
pretend that it runs with the c t obtained from bFPL and try to evaluate its (virtual) 
performance. 

We then use Proposition^ to bring into the play another virtual algorithm, namely 
FPL. Since for given randomization history, the expected performance of FPL and 
FPL coincide, © is proven if we can show 

E [cp | h <t ] - E E[cj | h <t ] < I (Tn § . (7) 

t=i t=\ 

Next, we perform the transition from real to estimated costs. Since the estimate c 
was defined to be unbiased, we have E[cJ|/i< t ] = E rttUt [cl\h <t \. By the same argu- 
ment, since the choice of FPL actually does not depend on r t and u t , E g „ [cf PL |/i <t ] = 
Eq» i r t ,u t [cf PL |/i<i] holds. Hence, ((7|) follows from 

T T 2 

EE^n^riM - E E n,4cjlM < l(Tn^I^y. (8) 
t=i t=i 
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Note that, somewhat curiously, FPL (like FPL) only incurs estimated costs in case 
of exploration, i.e. where it actually did not decide the action. We need yet another 
virtual algorithm, infeasible FPL or IFPL, defined as 

IFPL(t,c 1:t ) = argmin {c\ :t - f }, (9) 

which uses the same perturbation as FPL. It is not feasible because at time t 
it makes use of the information c t , which is only available afterwards. As it is a 
virtual algorithm, this does not cause any problems. By |HP05[ Theorem 4], which 
is proven by an argument very similar to (|T3*j) below, in case of exploration (i.e. 
r t = 1) it holds that E q ^\h <t ,r t = 1] < E q ^[cf^\h <t ,r t = 1] + r, t (ff . We 
remark that this step is valid also for independently sampled perturbations q t . Clearly, 
EqJcf PL |/i< t , r t = 0] = E (? Jcf PL |/i <t , r t = 0] in case of exploitation (r t = 0). Thus in 
expectation w.r.t. and r t , and for any u t , 

E.JcPI^t] = E^JcplM < E^cf^lM + 

The sum over < £~3 (n^f\ogn)^ is bounded as in (jSJ), and we see that (jSj) holds 
if we can show 

T T 2 

Y^^nA^hKt] -J2 E ruuAci\h <t ] < (Tny/l^y. (10) 

t=i t=i 

The rest of the proof now follows as in |KV03j or |HP 05j. In order to maintain self- 
containedness, we give it here. Actually we verify (JHH) for any choice of ri : T,wi : T, 
then it also holds in expectation. 

In the following, we suppress the dependency on r 1: y, % in the notation. Then all 
expectations are w.r.t. q*. We use the following convenient notation from jKV03j : For 
a vector x G M. n , let M(x) be the unit vector which has a 1 at the index argminjja;*} 
and 0's at all other places. Then the process of selecting a minimum can be written as 
scalar product: minjja;*} = M(x)°x. For convenience, let r/ = oo and c 1:t = c\. t — — . 
Then it is easy to prove by induction |KV03[ IHP05j that 

T T 

t=i t=i 

In order to estimate Ecj^ L , we take expectations on both sides. Then observe 
EM(c 1:T ) °c 1:T < EM(c 1:T )°c 1:T = min i {^. T } - EM(£l;T) ° g * < c\. T - — by definition 



of M. The negative term on the l.h.s. of (jllj) may be bounded by Y2t=i M(ci :t ) °q* 




< EL^H)^(i~) = < ^ (^e IKYQ31 or |EE05| for the 

last estimate). Plugging these estimates back into ([lip while observing — — > 

and 7]t = T~3 (^p) 1 (which holds because of T > (nlogn) 2 ), finally shows ()10|) and 
concludes the proof of the theorem. □ 
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Proposition 3 (High probability bound) For each T > 1 and < 5 < I, the actual 
costs of bFPL are bounded with probability at least 1 — 5 by 

c^ L < Ec^ L + ^Tlogf. 

Proof. Again we use the explicit notation from the proof of the previous theo- 
rem. It is easy to see that the sequence of random variables X T = Y^t=i (4™" — 
Er t ,u t ,qtl c t WL \h<t\) is a martingale w.r.t. the filter of sigma-algebras generated by the 
randomization history h\. t . Moreover, its differences are bounded by \X t — Xt-i\ < 1. 
Consequently, by Azuma's inequality, the probability that X t exceeds some A > is 
bounded by 8 = 2exp ( — |^). Solve this for A to obtain the assertion. □ 



4 Using all observations 

The algorithm bFPL considered so far does only uses a 7- fraction of all the input. 
It is thus a label efficient decision maker [CBLS04a, CBLS04bJ. One possible way 
to specify a label efficient problem setup is to require that the master usually does 
not observe anything, and it incurs maximal cost if it decides to observe something 
[CBLS04bJ. Since just before ©, we upper bounded the costs in case of exploration 
by 1 , it is immediate that the same analysis and hence also Theorem |2] transfer to the 
label efficient case. [CBLS04b, Sec. 5] prove that there is a label efficient prediction 

2 

problem such that any forecaster incurs a regret proportional to t$. Hence the bound 
in Theorem |21 is essentially sharp for bFPL. 

Of course, the usual bandit setup does not require the master to make use of only 
a tiny fraction of all information available. For weighted forecasters, it is very easy to 
produce an unbiased cost estimate if each round's inputs are used. It turns out that 
then regret bound proportional to \fi can be obtained [ACBFS03 . Unfortunately this 
is different for FPL, as here the sampling probabilities are not explicitely available. In 
the following, we will first discuss the computationally infeasible case assuming that 
we know the sampling probabilities. After that, we show how to approximate them 
by a Monte Carlo simulation to sufficient accuracy. 

Surprisingly, it is possible to work with the plain FPL algorithm from without 
exploration. We just have to use the correct estimated cost vectors, 

cj/p(/r L = *) in = /r L m) 

otherwise, 

where If 91 " was FPL's choice at time t. We assume that the values P(lf L = i) arc 
provided by some oracle. 

It is not hard to adapt the proof of Theorem El to analyze FPL under these condi- 
tions. As in the steps up to (jH}, 

E'<?t[ C i PL |^<t] = ~Eq*,qt,rt,u t [cY L \h<t\ = Eq*,q t ,rt,u t [c] ^ (<Zt) | h <t ] ■ 
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The overly explicit notation c^ L ^ 9 *\qt) serves to remind that the cost vector estimated 
is obtained using q t , while FPL's choice incurring cost stems from q*. It is essential 
that q t and q* are independent. Observe that in general, E q„,q t ,rt,ut[cT^ \1t)\h<t] ^ 
Eq ti r tiUt [cf PL( '' ?t ' ) {qt) | h<t] '■ the latter quantity, which is the actual estimated cost of FPL's 
choice, is biased and too large. 

Abbreviate p % = P(/ t FPL = i) and 7r* = P(/ i IFPL = i). Denote the exponential 
distribution by \i and integration with respect to q 1 . . . q n without the ith coordinate 
by J . . . dfi{q^ % ) . Moreover, for x G R, let x + = max{x,0}. Then, similarly to the 
proof of |HP05[ Theorem 4] , 

oo 

P, = / / W = / e-^^'^JKl*) (13) 

max{77t(c! l <t -c i <t )+ct'} 

< ep'e ^ v dfx{q^) 



< ev l I e ^ l d/Ji{q ) = &p n. 



Hence, 7r* > p l e ^ > p % [ 1 — ) = p l — i] t , which implies 



i=l j=l P i=l 

n 



i=l 



This shows the step from feasible to infeasible FPL. The last step from infeasible 
FPL to the best decision in hindsight proceeds as shown already above and in |KV03| 
HP05 . Like before, it causes the upper bound of the cumulative regret to increase by 
This is true for any (qx : T, T\-t, u% : t)i hence also in expectation. The total regret 
is thus upper bounded by + n J2t=\ Vt, an d we have just proved: 

Theorem 4 The algorithm FPL (QJ) ; obtaining cost estimates according to and 
with learning rate rj t = v/^f achieves a regret of at most 

Ec ^t ~ 4:T < 2y/2Tnlogn for any i 6 {1 . . . n}. (14) 



We would like to point to a quite remarkable symmetry break here. It is straight- 
forward to formulate FPL and the analysis from Section El for reward maximization 
instead of cost minimization. Then the (perturbed) leader is the expert with the 
highest (perturbed) reward, and perturbations are added to the scores. In the full 
information game, this reward maximization is perfectly symmetric to cost minimiza- 
tion by just setting reward] = 1 — c\: all probabilities, distributions, and outcomes 
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will be exactly the same. This is different in the partial observation case: There, 
in case of reward, the expert by FPL is the only one which can gain score. This 
is an advantage, in contrast to the disadvantage in case of loss minimization: Here, 
the selected expert is the only one to worsen its score. Put it differently, there is an 
automatic exploration or self-stabilization in the cost minimization case. With this 
intuition, it is less surprising that we did not need explicit exploration in Theorem 0] 
The corresponding result for reward maximization would not hold, as simple coun- 
terexamples show. Formally, it is the step from FPL to infeasible FPL which fails: A 
computation similar to (|13|) only shows 7r* < p l ep i , which does not imply a sufficiently 
strong assertion in general. However, reintroducing the exploration rate 74, we may 



set rjt = — . This implies < 1 for all i, hence < 1 + 2^. Letting 7$ = 



we can conclude a bound like (jUj) . 
4.1 A computationally feasible algorithm 

We conclude this section by discussing a computationally feasible variant of FPL using 
all observations. This algorithm is constructed in a straightforward way: Select the 
current action i = 1^ according to FPL and substitute the estimate c\ from (fT2*|) by 
c\ = -~. It remains to estimate p\ by a Monte Carlo simulation. 

There are two possibilities of error: either p\ overestimates p\, or it underestimates 
p\. The respective consequences are different: If p\ > p\, then the instantaneous cost 
of the selected expert is just underestimated. We can account for this by adding a 
small correction to the instantaneous regret. At the end of the game, we perform well 
with respect to the underestimated costs, which are upper bounded by the true costs. 
This does not cause any further problems. The case p\ < p\ is more critical, since then 
at the end of the game we perform well only w.r.t. overestimated costs. We therefore 
have to treat this case more carefully. 

Problems arise if the true probability p\ is very close to 0, as then the Monte Carlo 
sample might contain very few or no hits and the variance of the estimated cost is 
high. Since FPL does not prevent this case, we reintroduce j t as an "exploration 
threshold". Let % = ^ < \ ■ We first assume that p\ > 7 t . If this assumption is false 
but we use p\ > 7*, then p\ is an overestimate and we have to consider an additional 
instantaneous regret. This case has probability at most 74. Consequently, as (true) 
instantaneous costs are always bounded by 1, the additional instantaneous regret is 
at most 7 t . 

We sample the perturbed leader k E N times and denote by a l {k) the number 
of times the leader happens to be expert i. Recall that expert i is the one already 
selected by FPL. By Hoeffding's inequality, the distribution of is sharply peaked 
around its mean p % : 
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We choose k such that the probability bounds on the r.h.s. are at most 7$, i.e. e~ 7 * 4fc < 
jt- Consequently we should sample k = |~7 t ~ 4 log(7 t -1 )] = |~2t 2 log(2\/t)] times. Hence 
the sampling complexity of the algorithm is 0(t 2 logt). Let 

-i r ai ( k ) it \ 

then pi < p\ with probability at least 1 — % (recall the assumption p\ > j t ). Hence 
the possibility of overestimate p\ > p\ causes an additional regret of j t - 

Finally we need to deal with possible underestimates. For some integer m > 1, the 
probability that p\ falls below p\ — ( - v/ "^ L - >7 is at most 



a'W _ J. <-" V / ™7 t 2 
k P - V2 



< e- m ^ k < 7™ (15) 



by Hoeffding's inequality. We partition the interval [jtiPf) of all possible underesti- 
mates into sub intervals A\ = \p\ — T^,Pt) an d 



A r . 



' i _ (y^+l)7 2 i _ (y^T+l) 7 2 \ > 9 

Pt V2 'Pt ^2 , m ^ z. 



We do not need to consider m with the property A m H [7t,p|) = 0. That is, we can 
restrict to m small enough that p\ - ^(V™ + l)7l > It ~ Let M be the 

largest m for which this condition is satisfied, then one can easily see sjm + 1 < 
y/M + 1 < V2(p - 7 t + \J\l 2 t)hl 

Claim 5 // m < M, toen ^^gi^ < | + 7*(v^ + 1) • 

This follows by a simple algebraic manipulation. Consequently, for p\ e A m , we 
have EcJ < c\ + + 1)74. Moreover, $ e A m occurs with probability at most 
7™ _1 according to (fTK)l . By bounding the expectation over all A m , we thus obtain an 
additional regret of at most 

M 00 _ o 

27t 7t 



m=l m=0 

since 7$ < |. Altogether, this proves the following theorem. 

Theorem 6 Let 7 t = ^= 6e i/ie exploration threshold. In each time step, after select- 
ing one expert i, let FPL obtain an estimate p\ = max | 7t , — ^} /or P(J t FPL = z), 
fry sampling the perturbed leader k = |~2t 2 log(2\/t)] times and counting the number 
of hits a l {k). Let the estimated cost of the selected expert be c\ = c\/p\, and the es- 
timated cost of all other experts be zero. Then the algorithm FPL (QJ) with learning 



rate rjt — y ^§ff achieves a regret of at most 

Ec^ - c\. T < 2y/2Tn\ogn + 7Vf for any i e {1 . . . 71} . (16) 
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For t = 1,2,3,... 

set c\ = for i G {i : t > t 1 } and c\ = 0y t min{w l : t > r'}) for i £ {i : t > t 1 } 
sample r t G {0, 1} independently s.t. P[r t = 1] = 7* 

If r t = 0, set l\ = arg min i: t> T i \& <t + log ™~ qt } (FPL on the active experts) 
If r t = 1, sample J| G {i : £ > r 1 } according to the weights ^ w ' — r 

play decision J t and observe cost c t * 

If r t = 1, set cj* = [cf Ei^r^l/bi^*] 

Figure 2: The algorithm bFPL for infinite expert class. The entering times r l , the 
exploration rate 74, and the learning rate r^, will be specified in Theorem |7J 



5 Infinite expert classes 

Here, we sketch a variant of bFPL, taken from |PH05j . with guaranteed worst-case 
performance against a bandit with countably infinitely many arms. So we consider 
the following setup: The adversary subsequently generates cost vectors c t G [0, 1] 00 , 
and at each time t we have to select one index or expert i and incur its cost c\. We 
learn only the cost of the selected expert. 

As a prerequisite, we need that each of the infinitely many experts is associated 

with a prior weight w l such that J2i — 1- Since in order to obtain a cost estimate c, 

the observed cost is divided by the weight of the sampled expert, we have to be careful 

not to admit too small weights. We need to keep control of the maximum possible 

expected cost, since otherwise the step from FPL to IFPL would be problematic. One 

possibility to do so is defining an entering time r l > 1 for each expert. Prior to 

r — " 

t\ the expert is not active and cannot be chosen. We choose r l = (^i) a , with 
< a < 1 to be defined later. Then it is not hard to see that the minimum weight 
of any active expert at time t is lower bounded: min{w* : t > t 1 } > t~ a . Letting 
the exploration rate be 74 = t~^ with < (3 < 1 to be defined later, the maximum 
unbiasedly estimated cost is at most t a+ ^ . For the step from FPL to IFPL to go 
through, we thus may choose rjt = t~ 2a ~ 2fB . Then both steps from bFPL to FPL 
and from FPL to IFPL each cause a regret of at most Ylt=i — \ZTffT l ~P. On the 
other hand, — causes a regret of at most T 2q+2/3 . In order to minimize these bounds 

' tit 

simultaneously, we choose (3 = ^-j^- 

In order to guarantee that the step from IFPL to some fixed expert holds, we have 
to correctly assign estimated costs to inactive experts. For example, if an expert enters 
the game and previously has been assigned no estimated cost at all, then a bound 
w.r.t. this expert may be difficult to obtain. We therefore assign maximum possible 
estimated costs to all inactive experts. Then one can show [PH05j that, evaluating 
the expected costs, the step IFPL to some fixed reference expert holds almost without 
modification. Clearly, the reference expert's estimated costs now exceed its true costs 
by at most YH=\ t a+ ^, which is easily shown to be upper bounded by + 
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This gives another additive bound to the regret in terms of the weight of the reference 
expert - there is not multiplicative factor of i any more. This is an artifact of the 
design of the algorithm and proof technique and does not mean that the new variant 
performs better than the old one. Actually, since a > 0, the bound is now 
as opposed to 0(ts) before. Choosing a large a results in a small (i) term, but the 
order in t gets large, while a small a has the opposite effect. 

The complete algorithm is specified in Figure 121 The following statement, which 
improves on the bounds given in [PH05] (they are based on the workaround from 
MB04J is an example where we select a = |. 

Theorem 7 Consider a bandit problem with countably many arms/experts, each ex- 
pert i having a prior weight w l such that the weights sum up to at most 1. Then 



the above described bFPL variant with entering times t % 



, exploration rate 



n ft = t 4 j an d learning rate r\ t = t satisfies the regret bound 

Ec^ L -cl :r <o((^) n + T§log^ 

for allT > 1, any adaptive assignment of the cost vectors and any reference expert i. 

The formal proof is omitted. It follows the outline of Theorem |21 using the argu- 
ments discussed above. Many of the arguments, including the step from IFPL to the 
reference expert, are formally carried out in (PH05J. 



6 Discussion 

The main statement of this paper is the following: 

// we have a regret minimization algorithm with a bound guaran- 
teed against an oblivious adversary, and if the algorithm chooses the 
current action/expert by some independent random sampling based 
on past cumulative scores (e.g. FPL or weighted majority), then the 
same bound also holds against an adaptive adversary. This is true 
both for full and partial observations. 

We have used this argument for showing bounds for FPL in the adversarial bandit 
problem. The strategy to use only feedback from exploration rounds which is common 
for FPL achieves a regret bound of 0(t~5). As the algorithm is label efficient, this 
bound is sharp. Using all observations allows to push the regret down to 0(\/i). 
Then however the sampling probabilities have to be approximated. 

In the same way, it is possible to use our argument for the general geometric 
online optimization problem |MB04| IAK04j . also resulting in a O(is) regret bound 
against adaptive adversary. An interesting open problem is the following: Under 
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which conditions and how is it possible to use all observations in the geometric online 
optimization problem, hopefully arriving at a 0(y/i) bound? 

We conclude with a note on regret against an adaptive adversary. We considered 
the external regret w.r.t. the best action/strategy/expert from a pool. There are two 
directions from here. One is to go to different regret definitions, such as internal regret. 
The other one is to change the reference and compare to the hypothetical performance 
of the best strategy, in this way accepting a stronger type of dependency of the future 
costs from the currently selected action (sec e.g. PH05J and the references therein). 
It is one of the major open problems to propose refined algorithms and prove better 
bounds in this model. 
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